The purpose of the MetabolomicPipeline package is to provide additional tools to complement the analysis done by Metabolon. In this vignette we demonstrate how to use MetabolomicsPipeline package for:
Exploratory Analysis
Subpathway Analysis
Metabolite pairwise comparisons
Subpathway box plots and line plots
In this vignette, we will use data which consists of 86 samples (42 males, 44 females), three treatment groups, and the samples were taken at three different time points.
We receive multiple datasets from Metabolon, each on a different tab of the Metabolon Excel file. The datasets needed for the downstream analysis are the sample metadata, chemical annotation, raw peak data and log-transformed peak data. The table below shows the tab that each dataset is on within the Metabolon Excel file.
| Data | Tab of excel file |
|---|---|
| Sample metadata | Sample Meta Data |
| Chemical annotation | Chemical Annotation |
| Raw peak data | Peak Area Data |
| Log-transformed peak data | Log Transformed Data |
The MetabolomicsPipeline package provides a convenient way to load each of these datasets together as a SummarizedExperiment using loadMetabolon(). However, for this vignette we are using the demo data provided with the package. This data is accessed using MetabolomicsPipeline::demoDat. In the following chunk we:
Load the demo data
Create a table demostring the sample distribution
################################################################################
### Load Data ##################################################################
################################################################################
# Load Metabolon data from the Excel file
#dat <- loadMetabolon(path = "../data/Metabolon.xlsx")
dat = MetabolomicsPipeline::demoDat
################################################################################
### Create Table 1 #############################################################
################################################################################
# Create table 1
tbl1 <- table1(~ GROUP_NAME + TIME1| Gender
, data = colData(dat))
# Display table 1
tbl1| Female (N=44) |
Male (N=42) |
Overall (N=86) |
|
|---|---|---|---|
| GROUP_NAME | |||
| Control | 14 (31.8%) | 14 (33.3%) | 28 (32.6%) |
| treat1 | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
| treat2 | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
| TIME1 | |||
| End | 14 (31.8%) | 15 (35.7%) | 29 (33.7%) |
| Onset | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
| PreSymp | 15 (34.1%) | 13 (31.0%) | 28 (32.6%) |
In data exploration, we use several methods to help us better understand the underlying patterns in the data without using a formal hypothesis test. In this pipeline, we are going to focus on two methods of data exploration:
A.) Principal component analysis
B.) Heatmaps
In general, Principal component analysis (PCA) reduces the number of variables in a dataset while preserving as much information from the data as possible. At a high level, PCA is constructed such that the first principal component (PC) accounts for the largest amount of variance within the data. The second PC accounts for the largest remaining variance, and so on. Additionally, each of the PCs produced by PCA is uncorrelated with the other principal components. PCA can allow us to visualize sources of variation in the data. The metabolite_pca function will enable us to specify a sample metadata variable to label the points in the plot. The metabolite_pca function has three arguments:
data: SummarizedExperiment with Metabolon experiment data.
Assay: Name of the assay to be used for the pairwise analysis (default=‘normalized’)
meta_var: A metadata variable to color code the PCA plot by.
###############################################################################
### Run PCA ###################################################################
###############################################################################
# Define PCA label from metadata
meta_var = "Gender"
# Run PCA
pca <- metabolite_pca( dat,
meta_var = meta_var)
# Show heatmap
pcaSuppose you notice a variable with clearly separated groups that is not a variable of interest. In that case, consider stratifying your downstream analysis by the values of that variable. For example, we will stratify the downstream analysis by male/female in our vignette data.
For our heatmap, the x-axis will be the samples, and the y-axis will be the metabolites. The values determining the colors will be the log normalized peak values for each metabolite in each observation. We can group the observations by the experimental conditions. Grouping the experimental conditions in a heatmap is another way of visualizing sources of variation within our data.
We can use the metabolite_heatmap function to create the heatmaps, which requires the following arguments.
data: SummarizedExperiment with Metabolon experiment data.
top_mets: The number of metabolites to include in the heatmap. The metabolites are chosen based on the metabolites with the highest variability.
group_vars: The variables to group the samples by. caption: The title of the heatmap.
caption: A title for the heatmap. If strat_var is used, the title will automatically include the stratum with the tile.
Assay: Which assay data to use for the heatmap (default=“normalized”).
… : You can add additional arguments to order the samples
In the chunk below, we create a PCA plot labeled by Gender. Then, we make three heatmaps increasing by complexity.
################################################################################
### Run Heatmaps ###############################################################
################################################################################
# Heatmap with one group
treat_heatmap <- metabolite_heatmap(dat,top_mets = 50,
group_vars = "GROUP_NAME",
strat_var = NULL,
caption = "Heatmap Arranged By Group",
Assay = "normalized",
GROUP_NAME)
as.ggplot(treat_heatmap)
# Heatmap with two groups
treatandtime <- metabolite_heatmap(dat,top_mets = 50,
group_vars = c("GROUP_NAME","TIME1"),
strat_var = NULL,
caption = "Heatmap Arranged By Group and TIME",
Assay = "normalized",
GROUP_NAME, desc(TIME1))
as.ggplot(treatandtime)
# Heatmap with 2 group and stratified
strat_heat <- metabolite_heatmap(dat,top_mets = 50,
group_vars = c("GROUP_NAME","TIME1"),
strat_var = "Gender",
caption = "Heatmap Arranged By Group and TIME",
Assay = "normalized",
GROUP_NAME, desc(TIME1))
## Female
as.ggplot(strat_heat[[1]])In the chemical annotation file, we will see that each metabolite is within a sub-pathway, and each subpathway is within a superpathway. There are several metabolites within each subpathway and several sub-pathways within each Super-pathway. We can utilize an Analysis of variance (ANOVA) model to test for a difference in peak intensities between the treatment groups at the metabolite level, which is already part of the Metabolon analysis. However, since multiple metabolites are within a subpathway, it is challenging to test if the treatment affected the peak data at the sub-pathway level. For this, we utilize a combined Fisher probability test. The combined Fisher test combines the p-values from independent tests to test the hypothesis for an overall effect. The Combined Fisher Probability is helpful for testing a model at the subpathway level based on the pvalues from the model at the metabolite level.
We will test at the subpathway level by combining the p-values for each metabolite within the subpathway for each model. We use a combination function given by \(\tilde{X}\) which combines the pvalues, resulting in a chi-squared test statistic.
\[ \tilde{X} = -2\sum_{i=1}^k ln(p_i) \] where \(k\) is the number of metabolites in the subpathway. We can get a p-value from \(P(X \geq\tilde{X})\), knowing that \(\tilde{X}\sim \chi^2_{2k}\). You will notice that smaller p-values will lead to a larger \(\tilde{X}\).
Since we are first testing each metabolite utilizing ANOVA, we make the following assumptions for each metabolite,
Independence: Each observation is independent of all other observations. Therefore, if you have collected multiple samples from the same subject then this assumption may be violated.
Normality: The metabolite log-scaled intensities follow a normal distribution within each of the treatment groups.
Homoscedasticity: Equal variance between treatment groups.
In addition to the assumptions in the ANOVA models at the metabolite level, the Fisher’s Combined probability places an independence assumption between the metabolites within the subpathway.
For more about the Combined Fisher Probability and other methods that can address this problem, see:
Loughin, Thomas M. “A systematic comparison of methods for combining p-values from independent tests.” Computational statistics & data analysis 47.3 (2004): 467-485.
To test our hypothesis at the subpathway level, we first have to form our hypothesis at the metabolite level. For each metabolite, we test three models.
1.) Interaction: \(log Peak = Treatment Group + Time + Treatment*Time\)
2.) Parallel: \(log Peak = Treatment Group + Time\)
3.) Single: \(log Peak = Treatment\)
For the interaction model, we are focusing only on the interaction term “Treatment*Time” to test if there is a significant interaction between our treatment and the time variable. The parallel model is testing if the time variable is explaining a significant amount of the metabolite variance with treatment included, and the treatment model is testing if the treatment explains a significant proportion of the variance for each metabolite.
We test at the subpathway level using the Combined Fisher Probability method to combine the p-values from each model for all metabolites within the subpathway. To run the subpathway analysis, we use the “subpathway_analysis” function, which requires the following arguments.
data: SummarizedExperiment with Metabolon experiment data.
treat_var: This is the name of the variable in the analysis data that is the main variable of interest.
block_var: This is the name of the blocking variable in the dataset. If the the experimental design does not include a blocking variable, then the value of block_var=NULL.
strat_var: Variable to stratify the subpathway analysis by. This is set to NULL by default and will not stratify the analysis unless specified.
Assay: Name of the assay to be used for the pairwise analysis (default=‘normalized’)
With the MetabolomicsPipeline package, we provide three different ways to summarize the results from the subpathway analysis.
Number of significant subpathways by model type (subpath_by_model)
Percentage of significant subpathways within superpathways (subpath_within_superpath)
Metabolite model results within a specified subpathway (met_within_sub)
################################################################################
## Stratified Analysis #########################################################
################################################################################
# Stratified Analysis
stratified = subpathway_analysis(dat,
treat_var = "GROUP_NAME",
block_var = "TIME1",
strat_var = "Gender",
Assay = "normalized")
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
################################################################################
### Results Plots ##############################################################
################################################################################
# 1. significant subpathways by model type
subpath_by_model(stratified)| Model Type | Female | Male |
|---|---|---|
| Interaction | 6 | 84 |
| Parallel | 20 | 12 |
| Single | 19 | 2 |
| None | 65 | 12 |
| Super Pathway | Percent Significant (Female) | |Percent Significant (Male |
|---|---|---|
| Xenobiotics | 5 / 5 (100%) | 5 / 5 (100%) |
| Amino Acid | 13 / 16 (81.25%) | 16 / 16 (100%) |
| Peptide | 3 / 5 (60%) | 3 / 5 (60%) |
| Nucleotide | 3 / 8 (37.5%) | 8 / 8 (100%) |
| Lipid | 16 / 53 (30.19%) | 46 / 53 (86.79%) |
| Carbohydrate | 2 / 8 (25%) | 7 / 8 (87.5%) |
| Cofactors and Vitamins | 2 / 11 (18.18%) | 9 / 11 (81.82%) |
| Energy | 0 / 2 (0%) | 2 / 2 (100%) |
| Partially Characterized Molecules | 0 / 1 (0%) | 1 / 1 (100%) |
# 3. Metabolites within subpathway
tables <- met_within_sub(stratified, subpathway = "Partially Characterized Molecules")
### Females
tables[[1]]| Metabolite Name | Interaction_pval P-Value | Parallel_pval P-Value | Single_pval P-Value |
|---|---|---|---|
| glycolate (hydroxyacetate) | 0.437 | 0.435 | 0.408 |
| iminodiacetate (IDA) | 0.936 | 0.373 | 0.017 |
| EDTA | 0.709 | 0.269 | 0.098 |
| ectoine | 0.833 | 0.742 | 0.572 |
| sulfate* | 0.596 | 0.880 | 0.858 |
| ethyl glucuronide | 0.293 | 0.891 | 0.329 |
| dimethyl sulfone | 0.703 | 0.350 | 0.908 |
| 3-acetylphenol sulfate | 0.533 | 0.033 | 0.018 |
| benzoylcarnitine* | 0.594 | 0.271 | 0.067 |
| S-(3-hydroxypropyl)mercapturic acid (HPMA) | 0.556 | 0.346 | 0.055 |
| O-sulfo-tyrosine | 0.558 | 0.632 | 0.086 |
| 3-hydroxypyridine sulfate | 0.470 | 0.163 | 0.001 |
| 6-hydroxyindole sulfate | 0.266 | 0.292 | 0.051 |
| 1,2,3-benzenetriol sulfate (2) | 0.208 | 0.097 | 0.000 |
| thioproline | 0.493 | 0.113 | 0.387 |
| 4-acetamidobenzoate | 0.069 | 0.026 | 0.084 |
| perfluorooctanesulfonate (PFOS) | 0.397 | 0.010 | 0.199 |
| methylnaphthyl sulfate (1)* | 0.443 | 0.670 | 0.065 |
| methylnaphthyl sulfate (2)* | 0.648 | 0.633 | 0.002 |
| 2,2’-methylenebis(6-tert-butyl-p-cresol) | 0.011 | 0.000 | 0.100 |
| 3-hydroxy-2-methylpyridine sulfate | 0.392 | 0.083 | 0.001 |
| 3,5-dichloro-2,6-dihydroxybenzoic acid | 0.442 | 0.002 | 0.253 |
| 3-bromo-5-chloro-2,6-dihydroxybenzoic acid* | 0.613 | 0.008 | 0.208 |
| 4-chlorobenzoic acid | 0.364 | 0.538 | 0.664 |
| perfluorohexanesulfonate (PFHxS) | 0.529 | 0.123 | 0.230 |
| Metabolite Name | Interaction_pval P-Value | Parallel_pval P-Value | Single_pval P-Value |
|---|---|---|---|
| glycolate (hydroxyacetate) | 0.054 | 0.069 | 0.181 |
| iminodiacetate (IDA) | 0.052 | 0.954 | 0.001 |
| EDTA | 0.116 | 0.987 | 0.005 |
| ectoine | 0.492 | 0.019 | 0.284 |
| sulfate* | 0.315 | 0.247 | 0.891 |
| ethyl glucuronide | 0.032 | 0.114 | 0.035 |
| dimethyl sulfone | 0.002 | 0.021 | 0.236 |
| 3-acetylphenol sulfate | 0.160 | 0.566 | 0.837 |
| benzoylcarnitine* | 0.378 | 0.272 | 0.055 |
| S-(3-hydroxypropyl)mercapturic acid (HPMA) | 0.033 | 0.027 | 0.113 |
| O-sulfo-tyrosine | 0.162 | 0.020 | 0.618 |
| 3-hydroxypyridine sulfate | 0.095 | 0.100 | 0.161 |
| 6-hydroxyindole sulfate | 0.064 | 0.069 | 0.498 |
| 1,2,3-benzenetriol sulfate (2) | 0.293 | 0.361 | 0.458 |
| thioproline | 0.004 | 0.931 | 0.371 |
| 4-acetamidobenzoate | 0.801 | 0.190 | 0.253 |
| perfluorooctanesulfonate (PFOS) | 0.058 | 0.122 | 0.387 |
| methylnaphthyl sulfate (1)* | 0.042 | 0.064 | 0.055 |
| methylnaphthyl sulfate (2)* | 0.033 | 0.013 | 0.067 |
| 2,2’-methylenebis(6-tert-butyl-p-cresol) | 0.006 | 0.000 | 0.419 |
| 3-hydroxy-2-methylpyridine sulfate | 0.258 | 0.095 | 0.369 |
| 3,5-dichloro-2,6-dihydroxybenzoic acid | 0.043 | 0.002 | 0.251 |
| 3-bromo-5-chloro-2,6-dihydroxybenzoic acid* | 0.044 | 0.020 | 0.135 |
| 4-chlorobenzoic acid | 0.933 | 0.343 | 0.617 |
| perfluorohexanesulfonate (PFHxS) | 0.026 | 0.486 | 0.049 |
We can look at the pairwise comparisons for all experimental groups at the metabolite level. Metabolon includes this as part of their analysis. However, if you need to change the model, you must rerun the pairwise analysis. We will use the metabolite_pairwise function within the MetabolomicsPipeline package, which requires the following arguments:
data: SummarizedExperiment with Metabolon experiment data.
Assay: Name of the assay to be used for the pairwise analysis (default=‘normalized’)
mets: Chemical ID for the metabolites of interest. If NULL then the pairwise analysis is completed for all metabololites.
form: This is a character string the resembles the right hand side of a simple linear regression model in R. For example form = “Group1 + Group2”.
strat_var: A variable in the analysis data to stratify the model by. If this is specified, a list of results will be returned.
We will produce a heatmap of the log fold changes for the metabolites with a significant overall p-value (which tested if the treatment group means were equal under the null hypothesis). The heatmap colors will only show if the log fold-change is greater than log(2) or less than log(.5). Therefore, this heatmap will only focus on comparisons with a fold change of two or greater. The met_est_heatmap function will produce an interactive heatmap using the results from the pairwise analysis.
Similar to the pairwise estimate heatmap, we will produce a heatmap
where the heatmap will only include metabolites with a significant
overall p-value, and the values in the heat map will only be colored if
the pairwise comparison is significant. We use the
met_p_heatmap function to create an interactive p-value heatmap.
################################################################################
#### Run Pairwise Comparisons ##################################################
################################################################################
strat_pairwise = metabolite_pairwise(dat,form = "GROUP_NAME*TIME1",strat_var = "Gender")
###############################################################################
## Create Estimate Heatmap #####################################################
################################################################################
met_est_heatmap(strat_pairwise$Female, dat)Visualizations of the data can help us see the underlying trends. Two useful visualizations are boxplots and line plots, we will be using the subpathway_boxplots and subpathway_lineplots functions to create them. The main utility of these functions is it allows you for focus on the metabolites within a subpathway. For both functions, the arguments are:
data: SummarizedExperiment with Metabolon experiment data.
subpathway: Character value of the subpathway of interest. This is case sensitive and must be in the chemical annotation file.
block_var: This the the name of the variable in the meta data that is used for the X axis of the box plots. We recommend using the “block_var” from the subpathway analysis.
treat_var: This is a grouping variable. As a recommendation the treatment groups should be used in the treat_var argument as this will provide a different color for each of the treatments making it easier to identify.
Assay: Name of the assay to be used for the pairwise analysis (default=‘normalized’)
… : Additional arguments to filter the analysis data by.
################################################################################
### BoxPlots ###################################################################
################################################################################
subpathway_boxplots(dat, subpathway = "Lactoyl Amino Acid", block_var = TIME1,
treat_var = GROUP_NAME, Assay = "normalized",Gender =="Female")
################################################################################
## Line plots ##################################################################
################################################################################
# Set up data
dat$TIME1 <- as.numeric(factor(dat$TIME1,
levels = c("PreSymp","Onset","End")))
# Create line plots
subpathway_lineplots(dat, subpathway = "Lactoyl Amino Acid",
block_var = TIME1,treat_var = GROUP_NAME, Assay = "normalized",Gender=="Female" ) +
xlab("Time")
#> `geom_smooth()` using formula = 'y ~ x'